RNA-Seq Data Analysis ◾ 173
overlap with multiple features (non-unique). A read that maps or overlaps with multiple
features is considered as ambiguous and will not be counted. Only reads mapping unam-
biguously to a single feature are counted.
For RNA-Seq read counts, we can use a variety of programs but the most used free open-
source programs include HTSeq-count [19] (Python-based program) and FeatureCounts
[20], which is used as a Linux command-line program or as a function in the R package
Rsubread. Both these programs require BAM files and GFT file (reference annotation file)
as inputs.
In the following, we will use HTSeq-count to count RNA-seq reads aligned to the genes
in chromosome 22. Install HTSeq-count by following the installation instructions avail-
able at “https://htseq.readthedocs.io/en/master/install.html”. The following HTSeq-count
command counts aligned reads in all BAM files stored in the “bam” subdirectory and saves
the output in “features/htcount.txt”. The “-m union” option specifies the union mode to
handle reads overlapping more than one feature. We used “--additional-attr=transcript_
id” to add transcript accession numbers to the output.
mkdir features
htseq-count \
-m union \
-f bam \
--additional-attr=transcript_id \
-s yes bam/*.bam \
gtf/hg38.ncbiRefSeq.gtf \
> features/htcount.txt
sed ‘/^__/ d’ < htcount.txt > htcount2.txt
The “sed” command has been used to remove the last rows that begin with “__” and to save
that change in a new file “htcount2.txt”, which we will use in the next step of the analysis.
FIGURE 5.3 HTSeq-count feature count.